AITopics | feedback and unknown transition function

Collaborating Authors

feedback and unknown transition function

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Neural Information Processing SystemsDec-25-2025, 19:33:19 GMT

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta > 0$ under any policy, we give a regret bound of $\tilde{O} ( L|X|\sqrt{|A|T} / \beta)$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L^{3/2} |X| |A|^{1/4} T^{3/4})$, that holds for an arbitrary transition function. To our knowledge these are the first algorithms that in our setting handle both bandit feedback and an unknown transition function.

bandit feedback, feedback and unknown transition function, online stochastic shortest path, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Aviv Rosenberg, Yishay Mansour

Neural Information Processing SystemsOct-3-2025, 08:46:10 GMT

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes.

algorithm, bandit uc-o-rep, transition function, (12 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada > Quebec > Montreal (0.04)
(2 more...)

Industry: Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Add feedback

Reviews: Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Neural Information Processing SystemsJan-26-2025, 03:17:37 GMT

The submission studies the adversarial online learning in episodic loop-free Markov decision processes. The importance of this work is that it is the first to provide the understanding to an adversarial online learning problem where the transition function is unknown, the loss functions are changing, and each feedback is bandit. The related work clearly describe the line of this research field from fixing an unknown transition and an unknown loss function to the setting studied in this submission. Although the MDPs considered in the submission is L-layered and loop-free, the results and the analysis pave the way for general MDPs. The main idea is the design of the confidence sets to include the optimal occupancy measure which induces the optimal policy.

feedback and unknown transition function, online stochastic shortest path, submission, (8 more...)

Neural Information Processing Systems

Industry: Education (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Neural Information Processing SystemsOct-10-2024, 15:26:33 GMT

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability \beta 0 under any policy, we give a regret bound of \tilde{O} ( L X \sqrt{ A T} / \beta), where T is the number of episodes, X is the state space, A is the action space, and L is the length of each episode. When this assumption is removed we give a regret bound of \tilde{O} ( L {3/2} X A {1/4} T {3/4}), that holds for an arbitrary transition function.

bandit feedback, feedback and unknown transition function, online stochastic shortest path, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Rosenberg, Aviv, Mansour, Yishay

Neural Information Processing SystemsMar-18-2020, 21:16:48 GMT

We consider online learning in episodic loop-free Markov decision processes (MDPs), where the loss function can change arbitrarily between episodes. The transition function is fixed but unknown to the learner, and the learner only observes bandit feedback (not the entire loss function). For this problem we develop no-regret algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability $\beta 0$ under any policy, we give a regret bound of $\tilde{O} ( L X \sqrt{ A T} / \beta)$, where $T$ is the number of episodes, $X$ is the state space, $A$ is the action space, and $L$ is the length of each episode. When this assumption is removed we give a regret bound of $\tilde{O} ( L {3/2} X A {1/4} T {3/4})$, that holds for an arbitrary transition function.

bandit feedback, feedback and unknown transition function, online stochastic shortest path, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback